{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# BioSentVec Tutorial"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This tutorial provides a fundemental introduction to our BioSentVec models. It illustrates (1) how to load the model, (2) an example function to preprocess sentences, (3) an example application that uses the model and (4) further resources for using the model more broadly."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Prerequisites"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Please download BioSentVec model and install all the related python libraries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sent2vec\n",
    "from nltk import word_tokenize\n",
    "from nltk.corpus import stopwords\n",
    "from string import punctuation\n",
    "from scipy.spatial import distance"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Load BioSentVec model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Please specify the location of the BioSentVec model to model_path. It may take a while to load the model at the first time."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "model successfully loaded\n"
     ]
    }
   ],
   "source": [
    "model_path = YOUR_MODEL_LOCATION\n",
    "model = sent2vec.Sent2vecModel()\n",
    "try:\n",
    "    model.load_model(model_path)\n",
    "except Exception as e:\n",
    "    print(e)\n",
    "print('model successfully loaded')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Preprocess sentences"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There is no one-size-fits-all solution to preprocess sentences. We demonstrate a representative code example as below. This is also consistent with the preprocessing appaorach when we trained BioSentVec models."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "stop_words = set(stopwords.words('english'))\n",
    "def preprocess_sentence(text):\n",
    "    text = text.replace('/', ' / ')\n",
    "    text = text.replace('.-', ' .- ')\n",
    "    text = text.replace('.', ' . ')\n",
    "    text = text.replace('\\'', ' \\' ')\n",
    "    text = text.lower()\n",
    "\n",
    "    tokens = [token for token in word_tokenize(text) if token not in punctuation and token not in stop_words]\n",
    "\n",
    "    return ' '.join(tokens)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "An example of using the preprocess_sentence function: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "breast cancers her2 amplification higher risk cns metastasis poorer prognosis\n"
     ]
    }
   ],
   "source": [
    "sentence = preprocess_sentence('Breast cancers with HER2 amplification have a higher risk of CNS metastasis and poorer prognosis.')\n",
    "print(sentence)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Retrieve a sentence vector"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once a sentence is preprocessed, we can pass it to the BioSentVec model to retrieve a vector representation of the sentence."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[ 0.27253592  0.04016513 -0.13868049  0.06607066  0.03410426  0.03702081\n",
      "   0.04780459  0.318374    0.1389506   0.14894584  0.03802885  0.16076139\n",
      "   0.27367333  0.28947747 -0.3635127   0.1523829   0.00113982  0.15947492\n",
      "  -0.00115095 -0.3911827   0.06040372 -0.30060792  0.5700456  -0.3073153\n",
      "   0.05641874 -0.38538572  0.03242918 -0.01758919 -0.53824794 -0.2036874\n",
      "   0.09088504  0.42208442  0.01777515  0.26457042  0.00444555 -0.4244185\n",
      "   0.08552625 -0.01220523 -0.52954006 -0.19729511  0.3146897   0.39812556\n",
      "  -0.73728865 -0.15572241  0.12493155 -0.189124    0.30150056 -0.13335498\n",
      "  -0.22929646  0.1923776  -0.25276372  0.48184827 -0.11678692  0.074292\n",
      "  -0.3565283   0.06902904 -0.16303737 -0.1516651  -0.16457589  0.2640424\n",
      "  -0.2330729   0.03231101  0.3361209   0.35289383 -0.23463576 -0.29648\n",
      "  -0.3083266   0.39252853 -0.24566592 -0.2444962   0.20645703 -0.04719147\n",
      "   0.10580424  0.00649089 -0.2572806  -0.333023   -0.03018534 -0.042082\n",
      "  -0.03446042  0.1267659   0.37817308 -0.38865507 -0.20552012 -0.34621498\n",
      "  -0.1216602  -0.04652812 -0.02347284 -0.24400087  0.16549529 -0.06411781\n",
      "   0.01422617 -0.12668294  0.5960534   0.02109158  0.16629732  0.17482263\n",
      "  -0.12253477  0.12936321  0.15015826  0.09612935  0.03910794 -0.09146566\n",
      "  -0.43439966 -0.07247142  0.26412925 -0.17527688 -0.13276757  0.20395164\n",
      "  -0.05921361  0.16484062  0.18909335 -0.09065875  0.14640309 -0.04357425\n",
      "  -0.31174526  0.13512115 -0.15826614  0.05530081 -0.32504323  0.20767705\n",
      "  -0.0941015  -0.01312271 -0.10174442  0.03745251 -0.14577436 -0.11705701\n",
      "   0.24673483 -0.29994592 -0.03089786  0.05221201  0.4669998   0.00442661\n",
      "   0.26255304 -0.0520683   0.24765283 -0.28208813  0.02101091 -0.2309345\n",
      "  -0.27185237 -0.06502334  0.04894404 -0.12051236  0.19054177  0.33650944\n",
      "   0.21696663 -0.09382363  0.04122993  0.02740302 -0.1523489   0.06898607\n",
      "   0.3646409   0.00145511  0.23472148  0.08020525 -0.06348559  0.38403732\n",
      "  -0.19614884  0.05751961  0.25427777 -0.06352087 -0.00501605  0.12870164\n",
      "  -0.12095109  0.01321181  0.30547306 -0.12648751 -0.22209032  0.20374553\n",
      "   0.30825672 -0.2554286   0.08377017  0.29781944 -0.29899627 -0.07965458\n",
      "  -0.18984668 -0.13143328  0.02504029 -0.15191814 -0.02388869 -0.06577343\n",
      "  -0.3067318  -0.07350196  0.07532682  0.04892201 -0.01745802 -0.10110494\n",
      "  -0.01072426 -0.01015269  0.232459   -0.32037356 -0.09599126 -0.09170408\n",
      "   0.07457744  0.10797482 -0.16640112 -0.02322571 -0.22640668 -0.32701987\n",
      "   0.29183394 -0.08617553  0.07842809  0.18027404  0.32083857 -0.29391503\n",
      "  -0.37091807  0.1317559  -0.04133325  0.02105725  0.06929144 -0.45651826\n",
      "   0.06086781  0.43626273  0.02961666  0.14899905  0.25753006 -0.15902318\n",
      "   0.27228698 -0.012717   -0.3683212   0.3146828   0.00246162 -0.2528494\n",
      "  -0.01773234 -0.16845916  0.15252198  0.05695811 -0.4019833   0.11004736\n",
      "  -0.4061533  -0.01654322 -0.08055133 -0.06888298  0.03975208 -0.12505263\n",
      "  -0.4766509  -0.1302982  -0.15458837  0.19499418 -0.20499982  0.01576013\n",
      "  -0.04100087 -0.03823095  0.01355971  0.31886473  0.3207466   0.30761683\n",
      "  -0.58859974 -0.05454841  0.09202857  0.07083365  0.12124014 -0.15489404\n",
      "  -0.1249956   0.19807188  0.02977463  0.06490497  0.0862143   0.09217765\n",
      "  -0.39212793 -0.1090506   0.3700054   0.2391053   0.1204542   0.06182214\n",
      "  -0.20115142 -0.19802506 -0.12779813  0.18747202  0.0733431  -0.09663613\n",
      "  -0.24403886  0.16471218 -0.12632118  0.3087525  -0.12539518 -0.02084013\n",
      "  -0.07293235 -0.38778466 -0.20683263  0.06490733  0.05344631 -0.28166145\n",
      "   0.0709516  -0.05099153  0.26141232 -0.02879013  0.3863618  -0.03771535\n",
      "   0.04465518  0.25189495 -0.05824171  0.04616535 -0.33440518  0.05650642\n",
      "   0.01963214  0.04899212 -0.12409336 -0.02178784 -0.02102915  0.02570555\n",
      "   0.13620213  0.01591191 -0.51012826 -0.11808088  0.16109395 -0.12763613\n",
      "  -0.09608371 -0.223153    0.10025517  0.110238    0.04289898  0.43777797\n",
      "  -0.07757877  0.3245564  -0.0072146   0.36475793 -0.23756203 -0.14881566\n",
      "   0.1897787  -0.22575381  0.32615083  0.16910845 -0.08788409 -0.07606266\n",
      "  -0.03706334  0.08212929 -0.19536538  0.19984807  0.04603511 -0.26996538\n",
      "   0.04950259 -0.03615545  0.1406415   0.2947527  -0.00611998 -0.05985891\n",
      "   0.01984618 -0.03949784 -0.01525426  0.29419264 -0.01415043 -0.17652188\n",
      "  -0.06262738 -0.22616321  0.25551927 -0.02472711  0.15726517 -0.14524549\n",
      "  -0.11207764  0.10489892  0.14721154  0.1193269  -0.0470333   0.08068092\n",
      "   0.06711143 -0.1101417   0.00740551  0.23555118 -0.04884436 -0.29348636\n",
      "   0.36853147  0.09429416  0.22065276  0.23430087 -0.0068337   0.06033167\n",
      "   0.14368132 -0.28589955  0.32065156 -0.02703334  0.14414166  0.11144061\n",
      "  -0.09757377  0.08389441  0.4110573  -0.17193225 -0.17498371  0.12369279\n",
      "  -0.17010431 -0.09807961 -0.07679521  0.13369125  0.13676417 -0.16726981\n",
      "   0.39855367  0.0587613   0.12028298  0.01342451 -0.07659346  0.03576399\n",
      "  -0.04420809  0.12297461  0.02851038 -0.01444774 -0.01379851 -0.08932398\n",
      "   0.28293097 -0.1373159   0.16300136  0.12364378 -0.2913006   0.25817928\n",
      "  -0.01344534 -0.24683551 -0.08785618 -0.1017781  -0.12594536 -0.17217784\n",
      "   0.12956655 -0.13296415  0.22922768  0.15616998 -0.2765172  -0.3030905\n",
      "   0.03086687 -0.00273167 -0.15588386 -0.05675261 -0.09152196  0.26230586\n",
      "  -0.01163875 -0.2478254   0.260964   -0.05098752 -0.02663371  0.08234623\n",
      "   0.34928283  0.8313451   0.02071937  0.24742903 -0.06239458  0.09169593\n",
      "   0.00140471 -0.06047087 -0.35359547  0.12234055  0.18345007 -0.14262569\n",
      "  -0.11202564  0.274945   -0.06307555  0.20897087  0.22961979 -0.31827667\n",
      "   0.12434521  0.09456863 -0.132533   -0.13584521 -0.36066884 -0.05460902\n",
      "  -0.14705043 -0.08507536  0.22685164  0.24383776  0.20274445  0.07966835\n",
      "  -0.06932851 -0.01657332 -0.35544744 -0.22558543 -0.0651169   0.08119379\n",
      "   0.3001279  -0.01761239 -0.01498686  0.11016284  0.2519153   0.02833793\n",
      "  -0.28951043  0.06437117 -0.25671995 -0.03743215 -0.22699313  0.24525918\n",
      "   0.04435244 -0.25781178  0.00997334 -0.07835439  0.22938563 -0.07016336\n",
      "  -0.24928015 -0.18942201 -0.1236209  -0.44305456  0.53566355 -0.18446858\n",
      "   0.30429277 -0.11268931 -0.11295509  0.25952902  0.19171143  0.07295282\n",
      "  -0.01309466  0.15677398 -0.1115496  -0.11746953 -0.34486744  0.01961437\n",
      "   0.08887484  0.1231166  -0.22707342  0.14050385  0.02042234 -0.27477872\n",
      "  -0.32859874  0.15609217 -0.15527791 -0.03412036 -0.13152814  0.5236449\n",
      "   0.19360445 -0.18125863  0.41408825  0.17874481 -0.0879835  -0.11195815\n",
      "  -0.08948261  0.23711275  0.10845808 -0.22963704  0.02916685  0.04244966\n",
      "  -0.0449315  -0.16313884 -0.36450905  0.06882233  0.10855233  0.3169161\n",
      "  -0.33788228 -0.11677711 -0.36983833  0.09579375 -0.02219467  0.11477247\n",
      "   0.02546611  0.08161401  0.08159067 -0.2501985  -0.23828559  0.36675447\n",
      "  -0.15668799  0.20695254  0.27773544  0.47669446  0.01058489  0.27333802\n",
      "  -0.39817995  0.23312205  0.11152606 -0.15429601 -0.21768859  0.02197697\n",
      "  -0.05461999  0.11158564 -0.3009951   0.04721674  0.33778647 -0.22506985\n",
      "   0.2090023   0.13018404 -0.17677754 -0.09073435 -0.157161   -0.12982582\n",
      "  -0.13903137  0.01262058 -0.06162163  0.11507569  0.21850152  0.09291503\n",
      "   0.13182876  0.02859347  0.12657352  0.3068309  -0.15490891 -0.04232102\n",
      "   0.062854   -0.15683283 -0.2431332  -0.20136073 -0.32315066  0.05642203\n",
      "  -0.16685694  0.24037287 -0.10076776 -0.15987408  0.04036417  0.06853651\n",
      "   0.06721435  0.09657718  0.21487527  0.04389333 -0.42330703 -0.12825093\n",
      "  -0.12326848 -0.26695827  0.0649719  -0.32621393 -0.09277593  0.04695158\n",
      "   0.16902225  0.12192411  0.02212488 -0.13833636  0.21684082 -0.15384167\n",
      "   0.00954215  0.21829392 -0.10491441  0.38043278 -0.08237162  0.22160071\n",
      "   0.07220576  0.3385922   0.18430929 -0.01216795  0.20997563  0.04614374\n",
      "   0.5460487  -0.02897776  0.14775318  0.31089064  0.27132967 -0.08209523\n",
      "   0.23873891 -0.06413503 -0.07715333 -0.02231805 -0.00694238  0.37205717\n",
      "  -0.1450972  -0.0704605  -0.02053621  0.11540693 -0.11201832 -0.1471214\n",
      "   0.04950135 -0.04224805  0.21448477 -0.22363718  0.02988946  0.07961679\n",
      "   0.02574715 -0.17271668  0.325553    0.01628166 -0.05568108 -0.3240605\n",
      "  -0.1429462   0.05608758 -0.01153869  0.03438982  0.08489512 -0.03345412\n",
      "  -0.04629951 -0.40246782  0.06087665  0.20731504 -0.20592833  0.2631903\n",
      "   0.12083606  0.03901361  0.22229938 -0.2662993   0.20107882 -0.20194705\n",
      "   0.12862273 -0.14036344 -0.23233241 -0.08034117  0.12506847 -0.1897902\n",
      "   0.0618707   0.15091741 -0.4029728  -0.10979341 -0.10763265  0.235283\n",
      "  -0.08089121  0.03753055  0.2415903  -0.33070192  0.03716518  0.33133337\n",
      "  -0.13763449 -0.0574756   0.32341847  0.10362037  0.12447642 -0.19017035\n",
      "   0.00549802  0.10385241  0.01570529 -0.11430962 -0.01734808 -0.10625661\n",
      "  -0.1896727   0.0568063   0.04407496  0.16548488]]\n"
     ]
    }
   ],
   "source": [
    "sentence_vector = model.embed_sentence(sentence)\n",
    "print(sentence_vector)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that you can also use embed_sentences to retrieve vector representations of multiple sentences."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The shape of the vector representation depends on the dimension parameter. In this case, we set the dimension to 700: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(1, 700)\n"
     ]
    }
   ],
   "source": [
    "print(sentence_vector.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Compute sentence similarity"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this section, we demonstrate how to compute the sentence similarity between a sentence pair using the BioSentVec model. We firstly use the above code examples to get vector representations of sentences. Then we compute the cosine similarity between the pair."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "cosine similarity: 0.9813870787620544\n"
     ]
    }
   ],
   "source": [
    "sentence_vector1 = model.embed_sentence(preprocess_sentence('Breast cancers with HER2 amplification have a higher risk of CNS metastasis and poorer prognosis.'))\n",
    "sentence_vector2 = model.embed_sentence(preprocess_sentence('Breast cancers with HER2 amplification are more aggressive, have a higher risk of CNS metastasis, and poorer prognosis.'))\n",
    "\n",
    "cosine_sim = 1 - distance.cosine(sentence_vector1, sentence_vector2)\n",
    "print('cosine similarity:', cosine_sim)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here is another example for a pair that is relatively less similar."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "cosine similarity: 0.7300089001655579\n"
     ]
    }
   ],
   "source": [
    "sentence_vector3 = model.embed_sentence(preprocess_sentence('Furthermore, increased CREB expression in breast tumors is associated with poor prognosis, shorter survival and higher risk of metastasis.'))\n",
    "cosine_sim = 1 - distance.cosine(sentence_vector1, sentence_vector3)\n",
    "print('cosine similarity:', cosine_sim)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. More resources"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The above example demonstrates an unsupervised way to use the BioSentVec model. In addition, we summarize a few useful resources:\n",
    "\n",
    "#### (1) The Sent2vec homepage (https://github.com/epfml/sent2vec) has a few pre-trained sentence embeddings from general English copora. \n",
    "#### (2) You can also develop deep learning models to learn sentence similarity in a supervised manner.\n",
    "#### (3) You can also use the BioSentVec in other applications, such as multi-label classification."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reference"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When using some of our pre-trained models for your application, please cite the following paper:\n",
    "\n",
    "Chen Q, Peng Y, Lu Z. BioSentVec: creating sentence embeddings for biomedical texts. 2018. arXiv:1810.09302."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.4.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
